Using Thesauri for Automatic Indexing and for the Visualisation of Multilingual Document Collections

نویسندگان

  • Ralf Steinberger
  • Johan Hagman
  • Stefan Scheer
چکیده

This article presents an approach for cross-language document comparison and for the visualisation of multilingual document collections. Document comparison usually relies on the calculation of the degree of lexical overlap between documents. As this is not possible for documents written in different languages, the contents of these documents first have to be mapped onto a language-independent representation. The JRC’s statistical tool for controlled vocabulary keyword assignment assigns descriptors of the multilingual Eurovoc thesaurus, which can be used for cross-language document comparison. The language-independent sets of thesaurus descriptors allow to identify, for a given document, the most similar documents even if they are written in different languages. They furthermore allow to organise and to visualise the structure and approximate contents of whole multilingual document collections in two-dimensional document maps.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Automatic annotation of multilingual text collections with a conceptual thesaurus

Automatic annotation of documents with controlled vocabulary terms (descriptors) from a conceptual thesaurus is not only useful for document indexing and retrieval. The mapping of texts onto the same thesaurus furthermore allows to establish links between similar documents. This is also a substantial requirement of the Semantic Web. This paper presents an almost language-independent system that...

متن کامل

Automatic Multi-label Subject Indexing in a Multilingual Environment

This paper presents an approach to automatically subject index fulltext documents with multiple labels based on binary support vector machines (SVM). The aim was to test the applicability of SVMs with a real world dataset. We have also explored the feasibility of incorporating multilingual background knowledge, as represented in thesauri or ontologies, into our text document representation for ...

متن کامل

Semantic Indexing of Multilingual Corpora and its Application on the History Domain

The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field. However, standard processing techniques based on statistical clues and keyword searches have clear limitations. Instead, we propose a knowledge-based processing pipeline which overcomes most of the limitations of these techniques. T...

متن کامل

Automatic Multilingual Indexing and Natural Language Processing

The number of documents being collected by information brokers such as bibliographic database producers, libraries and publishers increases rapidly. The consequence is a huge demand for indexing and classification. So far this has had to be carried out manually. The system AUTINDEX, which is described in this paper offers tools for monolingual as well as for multilingual automatic indexing and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000